22 research outputs found
DDS3D: Dense Pseudo-Labels with Dynamic Threshold for Semi-Supervised 3D Object Detection
In this paper, we present a simple yet effective semi-supervised 3D object
detector named DDS3D. Our main contributions have two-fold. On the one hand,
different from previous works using Non-Maximal Suppression (NMS) or its
variants for obtaining the sparse pseudo labels, we propose a dense
pseudo-label generation strategy to get dense pseudo-labels, which can retain
more potential supervision information for the student network. On the other
hand, instead of traditional fixed thresholds, we propose a dynamic threshold
manner to generate pseudo-labels, which can guarantee the quality and quantity
of pseudo-labels during the whole training process. Benefiting from these two
components, our DDS3D outperforms the state-of-the-art semi-supervised 3d
object detection with mAP of 3.1% on the pedestrian and 2.1% on the cyclist
under the same configuration of 1% samples. Extensive ablation studies on the
KITTI dataset demonstrate the effectiveness of our DDS3D. The code and models
will be made publicly available at https://github.com/hust-jy/DDS3DComment: Accepted for publication in 2023 IEEE International Conference on
Robotics and Automation (ICRA
Focal Inverse Distance Transform Maps for Crowd Localization and Counting in Dense Crowd
In this paper, we propose a novel map for dense crowd localization and crowd
counting. Most crowd counting methods utilize convolution neural networks (CNN)
to regress a density map, achieving significant progress recently. However,
these regression-based methods are often unable to provide a precise location
for each person, attributed to two crucial reasons: 1) the density map consists
of a series of blurry Gaussian blobs, 2) severe overlaps exist in the dense
region of the density map. To tackle this issue, we propose a novel Focal
Inverse Distance Transform (FIDT) map for crowd localization and counting.
Compared with the density maps, the FIDT maps accurately describe the people's
location, without overlap between nearby heads in dense regions. We
simultaneously implement crowd localization and counting by regressing the FIDT
map. Extensive experiments demonstrate that the proposed method outperforms
state-of-the-art localization-based methods in crowd localization tasks,
achieving very competitive performance compared with the regression-based
methods in counting tasks. In addition, the proposed method presents strong
robustness for the negative samples and extremely dense scenes, which further
verifies the effectiveness of the FIDT map. The code and models are available
at https://github.com/dk-liang/FIDTM.Comment: The code and models are available at
https://github.com/dk-liang/FIDT
LATFormer: Locality-Aware Point-View Fusion Transformer for 3D Shape Recognition
Recently, 3D shape understanding has achieved significant progress due to the
advances of deep learning models on various data formats like images, voxels,
and point clouds. Among them, point clouds and multi-view images are two
complementary modalities of 3D objects and learning representations by fusing
both of them has been proven to be fairly effective. While prior works
typically focus on exploiting global features of the two modalities, herein we
argue that more discriminative features can be derived by modeling ``where to
fuse''. To investigate this, we propose a novel Locality-Aware Point-View
Fusion Transformer (LATFormer) for 3D shape retrieval and classification. The
core component of LATFormer is a module named Locality-Aware Fusion (LAF) which
integrates the local features of correlated regions across the two modalities
based on the co-occurrence scores. We further propose to filter out scores with
low values to obtain salient local co-occurring regions, which reduces
redundancy for the fusion process. In our LATFormer, we utilize the LAF module
to fuse the multi-scale features of the two modalities both bidirectionally and
hierarchically to obtain more informative features. Comprehensive experiments
on four popular 3D shape benchmarks covering 3D object retrieval and
classification validate its effectiveness
SOOD: Towards Semi-Supervised Oriented Object Detection
Semi-Supervised Object Detection (SSOD), aiming to explore unlabeled data for
boosting object detectors, has become an active task in recent years. However,
existing SSOD approaches mainly focus on horizontal objects, leaving
multi-oriented objects that are common in aerial images unexplored. This paper
proposes a novel Semi-supervised Oriented Object Detection model, termed SOOD,
built upon the mainstream pseudo-labeling framework. Towards oriented objects
in aerial scenes, we design two loss functions to provide better supervision.
Focusing on the orientations of objects, the first loss regularizes the
consistency between each pseudo-label-prediction pair (includes a prediction
and its corresponding pseudo label) with adaptive weights based on their
orientation gap. Focusing on the layout of an image, the second loss
regularizes the similarity and explicitly builds the many-to-many relation
between the sets of pseudo-labels and predictions. Such a global consistency
constraint can further boost semi-supervised learning. Our experiments show
that when trained with the two proposed losses, SOOD surpasses the
state-of-the-art SSOD methods under various settings on the DOTA-v1.5
benchmark. The code will be available at https://github.com/HamPerdredes/SOOD.Comment: Accepted to CVPR 2023. Code will be available at
https://github.com/HamPerdredes/SOO
CrowdCLIP: Unsupervised Crowd Counting via Vision-Language Model
Supervised crowd counting relies heavily on costly manual labeling, which is
difficult and expensive, especially in dense scenes. To alleviate the problem,
we propose a novel unsupervised framework for crowd counting, named CrowdCLIP.
The core idea is built on two observations: 1) the recent contrastive
pre-trained vision-language model (CLIP) has presented impressive performance
on various downstream tasks; 2) there is a natural mapping between crowd
patches and count text. To the best of our knowledge, CrowdCLIP is the first to
investigate the vision language knowledge to solve the counting problem.
Specifically, in the training stage, we exploit the multi-modal ranking loss by
constructing ranking text prompts to match the size-sorted crowd patches to
guide the image encoder learning. In the testing stage, to deal with the
diversity of image patches, we propose a simple yet effective progressive
filtering strategy to first select the highly potential crowd patches and then
map them into the language space with various counting intervals. Extensive
experiments on five challenging datasets demonstrate that the proposed
CrowdCLIP achieves superior performance compared to previous unsupervised
state-of-the-art counting methods. Notably, CrowdCLIP even surpasses some
popular fully-supervised methods under the cross-dataset setting. The source
code will be available at https://github.com/dk-liang/CrowdCLIP.Comment: Accepted by CVPR 202
SAM3D: Zero-Shot 3D Object Detection via Segment Anything Model
With the development of large language models, many remarkable linguistic
systems like ChatGPT have thrived and achieved astonishing success on many
tasks, showing the incredible power of foundation models. In the spirit of
unleashing the capability of foundation models on vision tasks, the Segment
Anything Model (SAM), a vision foundation model for image segmentation, has
been proposed recently and presents strong zero-shot ability on many downstream
2D tasks. However, whether SAM can be adapted to 3D vision tasks has yet to be
explored, especially 3D object detection. With this inspiration, we explore
adapting the zero-shot ability of SAM to 3D object detection in this paper. We
propose a SAM-powered BEV processing pipeline to detect objects and get
promising results on the large-scale Waymo open dataset. As an early attempt,
our method takes a step toward 3D object detection with vision foundation
models and presents the opportunity to unleash their power on 3D vision tasks.
The code is released at https://github.com/DYZhang09/SAM3D.Comment: Technical Report. The code is released at
https://github.com/DYZhang09/SAM3
Diffusion-based 3D Object Detection with Random Boxes
3D object detection is an essential task for achieving autonomous driving.
Existing anchor-based detection methods rely on empirical heuristics setting of
anchors, which makes the algorithms lack elegance. In recent years, we have
witnessed the rise of several generative models, among which diffusion models
show great potential for learning the transformation of two distributions. Our
proposed Diff3Det migrates the diffusion model to proposal generation for 3D
object detection by considering the detection boxes as generative targets.
During training, the object boxes diffuse from the ground truth boxes to the
Gaussian distribution, and the decoder learns to reverse this noise process. In
the inference stage, the model progressively refines a set of random boxes to
the prediction results. We provide detailed experiments on the KITTI benchmark
and achieve promising performance compared to classical anchor-based 3D
detection methods.Comment: Accepted by PRCV 202
Fault Diagnosis of Main Pump in Converter Station Based on Deep Neural Network
As the core component of the valve cooling system in a converter station, the main pump plays a major role in ensuring the stable operation of the valve. Thus, accurate and efficient fault diagnosis of the main pump according to vibration signals is of positive significance for the detection of failure equipment and reducing the maintenance cost. This paper proposed a new neural network based on the vibration signals of the main pump to classify four faults and one normal state of the main pump, which consisted of a convolutional neural network (CNN) and long short-term memory (LSTM). Multi-scale features were extracted by two CNNs with different kernel sizes, and temporal features were extracted by LSTM. Moreover, random sampling was used in data processing for imbalanced data, which is meaningful for data symmetry. Experimental results indicated that the accuracy of the network was 0.987 obtained from the test set, and the average values of F1-score, recall, and precision were 0.987, 0.987, and 0.988, respectively. It was found that the proposed network performed well in a multi-label fault diagnosis of the main pump and was superior to other methods